Thai Word Segmentation Verification Tool

نویسندگان

  • Supon Klaithin
  • Kanyanut Kriengket
  • Sitthaa Phaholphinyo
  • Krit Kosawat
چکیده

Since Thai has no explicit word boundary, word segmentation is the first thing to do before developing any Thai NLP applications. In order to create large Thai word-segmented corpora to train a word segmentation model, an efficient verification tool is needed to help linguists work more conveniently to check the accuracy and consistency of the corpora. This paper proposes Thai Word Segmentation Verification Tool Version 2.0, which has significantly been improved from the version 1.0 in many aspects. By using hash table in its data structures, the new version works more rapidly and stably. In addition, the new user interfaces have been ameliorated to be more user-friendly too. The description on the new data structures is explained, while the modification of the new user interfaces is described. An experimental evaluation, in comparing with the previous version, shows the improvement in every aspect.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Thoughts on Word and Sentence Segmentation in Thai

This paper discusses problems of word and sentence segmentation in Thai. Disagreements on word segmentation are caused mostly from compound words. To set a standard resource and tool of word segmentation, we suggest that only simple words and true compound words should be segmented in the process of word segmentation. Other compounds can be grouped later by the same means as multiword identific...

متن کامل

A Collaborative Framework for Collecting Thai Unknown Words from the Web

We propose a collaborative framework for collecting Thai unknown words found on Web pages over the Internet. Our main goal is to design and construct a Webbased system which allows a group of interested users to participate in constructing a Thai unknown-word open dictionary. The proposed framework provides supporting algorithms and tools for automatically identifying and extracting unknown wor...

متن کامل

A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs

Word segmentation is an important task in natural language processing, especially for languages without word boundaries, such as Thai language. Many Thai word segmentation programs have been developed. Researchers and developers in Thai documents usually spend a tremendous amount of time in studying and trying different Thai word segmentation programs. This paper presents the performance of six...

متن کامل

Non-Dictionary-Based Thai Word Segmentation Using Decision Trees

For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve...

متن کامل

Context Sensitive Pattern Based Segmentation: A Thai Challenge

A Thai written text is a string of symbols without explicit word boundary markup. A method for a development of a segmentation tool from a corpus of already segmented text is described. The methodology is based on the technology of competing patterns, evolved from algorithm for English hyphenation. A new UNICODE pattern generation program, OPATGEN, is used for the learning phase. We have shown ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011